Skip to content

feat: /discover command — data stack setup with project_scan tool#30

Merged
anandgupta42 merged 7 commits intomainfrom
feat/init-command
Mar 4, 2026
Merged

feat: /discover command — data stack setup with project_scan tool#30
anandgupta42 merged 7 commits intomainfrom
feat/init-command

Conversation

@anandgupta42
Copy link
Copy Markdown
Contributor

Summary

  • Replaces the AGENTS.md-generating /init command with a comprehensive data stack scanner
  • New project_scan tool detects dbt projects, warehouse connections (from dbt profiles, Docker, env vars), installed tools, and config files
  • Updated /init template guides the AI agent through a 5-step setup flow: scan → review → add connections → index schemas → show next steps
  • Documentation updated across getting-started, warehouse tools, commands, and TUI pages

New Files

File Purpose
src/tool/project-scan.ts project_scan tool with 5 exported detection functions + connection deduplication
test/tool/project-scan.test.ts 71 bun:test cases (detectGit, detectDbtProject, detectEnvVars, parseToolVersion, detectDataTools, detectConfigFiles)
altimate-engine/tests/test_env_detect.py 24 pytest cases for env var → warehouse mapping

Modified Files

File Change
src/tool/registry.ts Register ProjectScanTool
src/command/template/initialize.txt New 5-step data stack setup prompt
src/command/index.ts Description: "scan data stack and set up connections"
docs/docs/getting-started.md /init as first-run command
docs/docs/data-engineering/tools/warehouse-tools.md Added project_scan, warehouse_add, warehouse_remove, warehouse_discover docs
docs/docs/data-engineering/tools/index.md Updated warehouse tools count (2 → 6)
docs/docs/configure/commands.md Documented built-in /init and /review commands
docs/docs/usage/tui.md Added /init to slash command examples

What project_scan detects

Category Method
Git branch, remote URL via git commands
dbt project Walks up for dbt_project.yml, parses manifest
dbt profiles Bridge call to ~/.dbt/profiles.yml
Docker DBs Bridge call for running Postgres/MySQL/MSSQL containers
Existing connections Bridge call to list configured warehouses
Env vars Snowflake, BigQuery, Databricks, Postgres, MySQL, Redshift
Schema cache Bridge call for indexed warehouse status
Data tools dbt, sqlfluff, airflow, dagster, prefect, soda, sqlmesh, great_expectations, sqlfmt
Config files .altimate-code/, .sqlfluff, .pre-commit-config.yaml

Test plan

  • bun test — 1220 pass, 0 fail (full suite including 71 new tests)
  • pytest tests/test_env_detect.py — 24 pass
  • mkdocs build --strict — docs build cleanly
  • Manual: run /init in TUI from a dbt project directory
  • Manual: run /init in TUI from a bare directory (no dbt)
  • Manual: verify connections discovered from env vars get offered for setup

🤖 Generated with Claude Code

@anandgupta42 anandgupta42 changed the title feat: /init command — data stack setup with project_scan tool feat: /discover command — data stack setup with project_scan tool Mar 4, 2026
anandgupta42 and others added 5 commits March 3, 2026 19:11
Replace the AGENTS.md-generating /init command with a comprehensive data
stack scanner that detects dbt projects, warehouse connections, Docker
databases, installed tools, and config files. The AI agent then walks
the user through adding connections, testing them, and indexing schemas.

New project_scan tool with 5 exported detection functions:
- detectGit: branch, remote URL
- detectDbtProject: dbt_project.yml, manifest, packages
- detectEnvVars: Snowflake, BigQuery, Databricks, Postgres, MySQL, Redshift
- detectDataTools: dbt, sqlfluff, airflow, dagster, prefect, soda, sqlmesh, great_expectations, sqlfmt
- detectConfigFiles: .altimate-code/, .sqlfluff, .pre-commit-config.yaml

Tests: 71 TypeScript (bun:test) + 24 Python (pytest)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Restore the original /init command (creates AGENTS.md) and move the
data stack setup functionality to /discover instead. Updates all docs
to reference /discover as the recommended first-run command.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
GitHub Actions checks out in detached HEAD state, so
git branch --show-current returns empty. The test now
accepts undefined branch in that case.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
In CI detached HEAD, git branch --show-current returns an empty
string. Convert empty string to undefined so callers get a clean
undefined instead of "".

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The "detects a git repository" test should only assert isRepo.
Branch validation is handled by the dedicated branch test which
accounts for CI detached HEAD returning undefined.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
stderr: "pipe",
})
if (isRepoResult.exitCode !== 0) {
return { isRepo: false }
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: detectGit (and detectDataTools) use Bun.spawnSync directly — this couples the tool to the Bun runtime. If there's ever a need to run under Node.js (or test with a different runner), these would need to be swapped for child_process.spawnSync. Fine for now since the project is Bun-only, but worth noting.

type: wh.type,
source: "env-var",
signal: matchedSignal,
config,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: The config object captures raw env var values including password, access_token, etc. These get returned to the LLM in the tool output. Consider either redacting sensitive fields (replace with "***") or only including non-secret fields in the scan output. The connection setup step (warehouse_add) can read the env vars directly when actually configuring.

database: "REDSHIFT_DATABASE",
user: "REDSHIFT_USER",
password: "REDSHIFT_PASSWORD",
},
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: DATABASE_URL is used as a Postgres signal, but many frameworks (Rails, Django, etc.) set DATABASE_URL for any database type including MySQL, SQLite, etc. This could produce false positives. Consider parsing the URL scheme (postgresql:// vs mysql://) before categorizing, or at least noting the assumption in the output.

for (const tool of DATA_TOOL_NAMES) {
try {
const result = Bun.spawnSync([tool, "--version"], {
stdout: "pipe",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The 9 tool version checks run sequentially. Since each has a 5s timeout, worst case is 45s. Could use Promise.all to run them in parallel and cut the scan time significantly.

Copy link
Copy Markdown

@jontsai jontsai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — solid feature addition with excellent test coverage (71 TS + 24 Python tests).

What's good:

  • Clean separation of detection functions, each independently testable
  • Thoughtful connection deduplication logic across sources
  • Resilient CI handling (detached HEAD edge case)
  • Well-structured /discover flow that guides users step by step
  • Docs are comprehensive and match the implementation

Minor items (posted inline):

  • Env var detection captures secrets (passwords, tokens) in scan output — consider redacting
  • DATABASE_URL assumed to be Postgres but could be any DB type
  • Tool version checks are sequential; parallelizing would speed up scans
  • Bun.spawnSync coupling is fine for now but noted for awareness

None are blockers. Ship it! 🚀

…arallelize tool checks

- Redact sensitive env var values (password, access_token, connection_string)
  in scan output with "***" so secrets are never sent to the LLM
- Parse DATABASE_URL scheme (postgresql://, mysql://, etc.) to detect the
  correct database type instead of assuming Postgres
- Parallelize tool version checks with Promise.all instead of sequential loop

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@setu-altimateai setu-altimateai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work addressing the previous review comments! The secret redaction and DATABASE_URL scheme parsing in the TypeScript implementation look solid. One issue with the Python test file — see inline comment. Overall LGTM pending that fix.

"schema": "SNOWFLAKE_SCHEMA",
"role": "SNOWFLAKE_ROLE",
},
},
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: This Python reference implementation is now out of sync with the TypeScript version in two ways:

  1. DATABASE_URL is still listed as a postgres signal (line 28), but the TS version removed it from postgres signals and handles it separately with scheme-based type detection (postgresql:// → postgres, mysql:// → mysql, etc.). This means the Python version will always classify DATABASE_URL as postgres regardless of the actual scheme.

  2. No secret redaction — the TS version now masks sensitive keys (password, access_token, connection_string) with "***", but the Python detect_env_connections() returns raw values. If this Python implementation is used later, it would leak secrets.

Since the docstring says "mirrors TypeScript detectEnvVars", these should be kept in sync.

- Remove DATABASE_URL from postgres signals, add scheme-based detection
- Add secret redaction (password, access_token, connection_string)
- Add tests for DATABASE_URL scheme parsing and deduplication

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@jontsai jontsai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All previous comments addressed ✅

  • Secret redaction with SENSITIVE_KEYS set — passwords, tokens, connection strings all masked as ***
  • DATABASE_URL scheme parsing maps to correct DB types (postgresql, mysql, redshift, sqlite)
  • Python test file now in sync with TS: includes SENSITIVE_KEYS, DATABASE_URL_SCHEME_MAP, and redaction tests
  • Tool version checks parallelized with Promise.all

Clean PR, great test coverage (71 TS + 24 Python). LGTM 🚀

Copy link
Copy Markdown

@setu-altimateai setu-altimateai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ✅

Final review complete. The previously flagged issue (Python test_env_detect.py out of sync with TS) is resolved:

  • ✅ DATABASE_URL now uses scheme-based detection matching TS (postgresql, postgres, mysql, mysql2, redshift, sqlite)
  • ✅ Sensitive values (password, access_token, connection_string, private_key_path) properly redacted with "***"
  • ✅ Python reference implementation mirrors TS detectEnvVars faithfully
  • ✅ Comprehensive test coverage across all warehouse types
  • ✅ project-scan.ts is well-structured with proper deduplication logic
  • ✅ Docs updated consistently

Ship it! 🚀

@anandgupta42 anandgupta42 merged commit 05b3d42 into main Mar 4, 2026
4 checks passed
@kulvirgit kulvirgit deleted the feat/init-command branch March 10, 2026 21:06
anandgupta42 added a commit that referenced this pull request Mar 17, 2026
* feat: replace /init with data stack setup command

Replace the AGENTS.md-generating /init command with a comprehensive data
stack scanner that detects dbt projects, warehouse connections, Docker
databases, installed tools, and config files. The AI agent then walks
the user through adding connections, testing them, and indexing schemas.

New project_scan tool with 5 exported detection functions:
- detectGit: branch, remote URL
- detectDbtProject: dbt_project.yml, manifest, packages
- detectEnvVars: Snowflake, BigQuery, Databricks, Postgres, MySQL, Redshift
- detectDataTools: dbt, sqlfluff, airflow, dagster, prefect, soda, sqlmesh, great_expectations, sqlfmt
- detectConfigFiles: .altimate-code/, .sqlfluff, .pre-commit-config.yaml

Tests: 71 TypeScript (bun:test) + 24 Python (pytest)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refactor: restore /init, rename data stack setup to /discover

Restore the original /init command (creates AGENTS.md) and move the
data stack setup functionality to /discover instead. Updates all docs
to reference /discover as the recommended first-run command.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: make detectGit test resilient to CI detached HEAD

GitHub Actions checks out in detached HEAD state, so
git branch --show-current returns empty. The test now
accepts undefined branch in that case.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: treat empty git branch as undefined in detectGit

In CI detached HEAD, git branch --show-current returns an empty
string. Convert empty string to undefined so callers get a clean
undefined instead of "".

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: remove branch type assertion from detectGit repo test

The "detects a git repository" test should only assert isRepo.
Branch validation is handled by the dedicated branch test which
accounts for CI detached HEAD returning undefined.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: address PR review — redact secrets, parse DATABASE_URL scheme, parallelize tool checks

- Redact sensitive env var values (password, access_token, connection_string)
  in scan output with "***" so secrets are never sent to the LLM
- Parse DATABASE_URL scheme (postgresql://, mysql://, etc.) to detect the
  correct database type instead of assuming Postgres
- Parallelize tool version checks with Promise.all instead of sequential loop

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: sync Python reference implementation with TypeScript changes

- Remove DATABASE_URL from postgres signals, add scheme-based detection
- Add secret redaction (password, access_token, connection_string)
- Add tests for DATABASE_URL scheme parsing and deduplication

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants